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Abstract 

We study the problem of finding the smallest m for which every element p of an exponential 
family £ with finite sample space can be written as a mixture of m elements of another exponential 
family £ ' as p = Y^T=i a ifi> wnere fi E £'» «i > Vz and Y^Li ^ = 1- Our approach is based 
on coverings and packings of the face lattice of the corresponding convex support polytopes. We 
use the notion of 5-sets, subsets of the sample space such that every probability distribution that 
they support is contained in the closure of £ . We find, in particular, that m = q N ~ x yields the 
smallest mixtures of product distributions containing all distributions of N g-ary variables, and 
that any distribution of N binary variables is a mixture of rn = 2 Ar_ ( fe+1 )(l + l/(2 /c — 1)) elements 
of the /c-interaction exponential family (k = 1 describes product distributions). 

1 Introduction 

A mixture model consists of probability distributions which can be written as the convex combination 
of a number of distributions from a specified model. The m-mixture of a model M is the following 
set of probability distributions: 

m 

Mixt™(.M) := { ]T a(j)fj : f 3 E M, a(j) > Vj and £™ ^(j) = l} . 

The numbers a(j) E M are called mixture weights and the summands fj mixture components. The 
model M is often an exponential family. There is an abundant literature on mixture models, for in- 
stance (23l[24l|33l, and there are many areas of application, for example modeling of heterogeneous 
populations, clustering, computer vision, machine learning |6], etc. Identifiability and density estima- 
tion have been studied with the familiar method of moments ll27l and the expectation maximization 
algorithm (3. There are results on dispersion of mixtures of exponential families ll30l and the promi- 
nent approximation of exchangeable distributions by mixtures of binomial distributions lH. However, 
the geometry of mixture models is not fully understood and there is lack of knowledge about their 
expressive power. How many mixture components from a simple exponential family (e.g. product 
distributions) are required to represent or to approximate a more complicated distribution? How many 
states of a latent variable are required to explain a stochastic experiment? These questions are im- 
portant in model design and model selection involving latent variables, in particular for the study of 
Restricted Boltzmann Machines or Deep Belief Networks Il25ll26ll . 



1 



2 



G. Montufar 



In this paper we propose to assess the expressive power of mixtures of exponential families by 
looking at the combinatorics of the support sets of distributions contained in the closure of these 
exponential families. This approach is very natural in the framework of JT4l . which foregrounds 
the combinatorial structure of the boundaries of exponential families - even more in view of recent 
advances in this direction (191 [101 ED ED- The following notion of support sets is important in our 
considerations: 

Definition 1. (S-sets) Given a set of probability distributions A4 on a finite set X we say that a set 
y C X is an S -set for M. iff every distribution with support y is contained in Ai C R*. 

The S stands for support and for simplex, which is motivated by the fact that the set of all distri- 
butions with support in 3^ C X is a simplex, a face of the probability simplex on X. Consider two 
exponential families £ and £' with finite sample space X, denote by V the set of strictly positive prob- 
ability distributions on X, and denote by £ the topological closure of the set £ C R x . The present 
paper is concerned with the following problems: 

(i) Find the smallest possible F = F{£, X) such that m>F^ Mixt m (£) = V. 

(ii) Find the largest possible F = F(£, X) such that Mixt m (£) D£'^m>F. 

(iii) Given that Mixt m (£) = V, do we have that Mixt m (£) = VI 

There are many related and interesting problems, for example: Find the smallest m for which Mixt m {£) = 
conv(£). This paper is organized as follows: Section [2] introduces notation and reviews basics about 
exponential families. Section [3] explains our approach. Section [4] provides technical results about 
coverings and packings using facial sets of hierarchical models. Section [5] contains the main results of 
the paper, about the necessary and sufficient number of mixture components from the independence 
model and from the fc-interaction exponential family to represent other families. 

2 Preliminaries 

We consider the sample space of TV £ N finite valued variables X := xfL-^Xi, where Xi is a finite set 
for every i E [N] := {1, . . . , N}. The case of g-ary variables corresponds to Xi ~ {0, . . . , q — 1} Mi 
and X ~W%. For any A C [N] we denote by x\ an element of x i(E \Xi or the natural restriction of 
some x e X. If A = {i, i + 1, . . . , i + k} we also write x\ +k for x\. The expression [x\] represents a 
cylinder set of dimension (N — |A|) which consists of ally E X with y\ = x\. In the binary case, the 
cylinder sets are in natural correspondence with the (sets of vertices of) faces of the TV-dimensional 
unit cube. The set of real functions on X is denoted by R*. 

The probability distributions supported on X are denoted by V(X), or just V if X is clear from the 
context. Given a reference measure v on X and a linear subspace V C R*, we define an exponential 
family £ v y on X in the usual way as the image of the following map: exp^ : V ^ V C ; / H> 
uexp(f)/ J2 x ex v ^ x ) ex P(/( x ))- All results presented in this work hold for any v E M> . For 
simplicity we set v = 1 and omit the subscript. A common choice of V are spaces generated by 
functions of a limited number of variables: For a collection of interaction sets AC2^ this is Va '= 
{Eaga/a: fx £ R x s.t. f\(xx,x [N ]\ x ) = f\(x x ,x [N] \ x ) Mx,x e X, VA E A}. Hierarchical 
models arise from inclusion complete interaction sets (see e.g. 1(211 ). The choice := {A C [N] : 
|A| < k} produces the k-interaction exponential family £ k := £y A . The independence model is 
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given by £ 1 and describes product distributions. There is a natural hierarchy of nested models £ 1 C 
£ 2 C ■ ■ ■ C £ N = V OHM- & matrix A E R dx * with row span R d - A = V is called a sufficient 
statistics of £y. The rows of A are functions on called observables. Denoting the columns by 
{A x } xe x we can write: 

£ v := [p e {x) oc exp((0, A x )) :0El d }c . 

For convenience we always denote a sufficient statistics by A and the corresponding exponential 
family by £. The parametrization of £ given above depends on A, but £ itself depends only on V. 
The relation p$ is bijective iff the set of all observables plus the constant function l(x) = 1 
are linearly independent. There is no loss of generality in assuming that 1 is an observable, and 
we assume this throughout our considerations. For binary variables a natural choice for the suffi- 
cient statistics of hierarchical models is given by submatrices of the following Hadamard matrix: 
A = (A\^ x ) Xe2 [N] x ^ x , A\^ x := (-i)l su PPO) nA l . i n this case the observables are well studied func- 
tions referred to as characters, and if A C 2^ is inclusion complete, then the rows of A indexed by 
A build an orthogonal basis of Va- 

The elements of £ are strictly positive. We denote by £ the topological closure of £ C R x . The 
convex support (marginal polytope) Q of £ is the image of the moment map: V — > R d ; p i-> A • p. 
This gives the ^-expectation values of the observables. A maps £ homeomorphically onto Q, and 
A • p is also called expectation parameter vector of p (see El OH). Moreover Q is a compact convex 
subset of R d with finitely many extreme points, i.e. a convex polytope |[T6L with the following vertex 
representation: 

Q := conv^}^ C R d . 

Not every A x must be a vertex (extreme point) of Q. A face F of Q is the intersection of Q with a hy- 
perplane of codimension one in R d such that all points of Q lie on one of the closed halfspaces defined 
through that hyperplane. Since we assume that 1 is an observable, Q itself is by this definition also a 
face. The set of faces is denoted by ^(Q) and their dimension is given by dim(F) := dimaff(F). 
The combinatorial type of Q is the set of all faces ^(Q) together with the partial order given by inclu- 
sion relations. For any < g < dim(Q) — 1 the union of (/-dimensional faces ] JFeJ r (Q):dim(F)= g F 
contains all vertices of Q [17, Theorem 15.1.2]. Any nonsingular affine transformation of a polytope 
yields a combinatorially equivalent polytope US Theorem 3.2.3] and hence the combinatorial type of 
Q depends only on V, the row span of A. A set 3^ C X is facial iff 3^ = {x E X : A x E F} for some 
face F of Q. We denote by T(£) C 2 X the set of facial sets of £. A fc-simplex is the convex hull 
of k points in general position in R d , where k < d + 1. The line of thoughts in the present paper is 
based on the following result due to Rauh, Kahle and Ay |[29l extending results by Geiger, Meek and 
Sturmfels lTT4l to the case of real valued sufficient statistics: 

Lemma. A set y is the support set of some distribution p E £ iff 3^ is facial. 

3 S-sets 

In this section we formalize the use of S-sets and the combinatorics of support sets to assess the ex- 
pressive power of mixtures of exponential families. We can find sufficient and necessary numbers of 
mixture components to represent a distribution by deriving bounds on the number of S-sets required 
to cover the sample space X or the support of that distribution. We will use this idea for our main 
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results in Section [51 

Given an exponential family £ on X we consider the following function, which gives the minimal 
cardinality of a facial packing of any set Z C X: 

4' 2* ~+ N ; Z h+ min{|{3^| : y t E T{£ ), = , 

and we set 4(^) = 00 if there doesn't exist a facial packing of Z. All 3^ in this definition must be 
subsets of Z. For hierarchical models every {x} is facial, provided that UagA^ = [N], and hence 
4 : = 4 k < 00 if ^ > 0- Clearly, all S-sets of £ must be contained in J r (^). We consider the 
smallest number of 5-sets that cover Z, given by the following function: 

4 : 2* -> N ; Z ^ min{|{^},| : # 5-set , 2 ^} , 

and we set ftf (i?) = oc if there doesn't exist an 5- set covering of Z. If ft 5-sets cover X, then at 
most ft 5-sets are needed for packing any Z C X, because any subset of an 5-set is an S-set. We 
abbreviate k%(X) with ft|. Given two exponential families £ and we will also consider 4 s' := 
max^ G jr(£:/) 4(^)> ^ e max i mum °f ^ restricted to the facial sets of £ f . The functions defined 
above can be easily defined for arbitrary models M C V, replacing facial sets by support sets of 
distributions within A4. 

Example 2 (Cylinder 5-sets). Any distribution p with support contained in a cylinder set [y\ c ], A C 
X, |A| = k is contained in Indeed, if p E P is arbitrary with support [j/ac], then p(x) = 
lim exp(/(xA) — ol J2jeA c 9j( x j))/Z, where Z is the normalization constant, f(x) = f(x\) is a 
function of k variables with f(x\) = \og(p(x)) + log(Z) Vx E [?/a c ] and are functions of one 
variable taking value for xj = yj and 1 otherwise. Therefore, the fc-dimensional cylinder sets are 
S-sets for £ k . If X = F^, then K s gk < q N ~ k . 

We can assess the expressive power of mixture models using £-sets and comparing the support sets 
of distributions from different models. The following lemma describes this very natural observation: 

Lemma 3. Consider two exponential families £, £' C V(X). 

• Ifm > ft| < oc, then Mixt m (£) = V. 

• Mixt m (£) D £' implies m> 4 s" 

In particular this entails that Mixt m (£) = V implies m > maxfti, and that ft^ £»/ — oc implies 
conv(£) 7^ £' '. Clearly, the lemma can be formulated for arbitrary models. In that case however, the 
implication of the first item is weaker: If m > k s m , then Mixt m (A4) = V. 

Proof of Lemma^ 1. Let {3^}f=i be an 5-set covering of X. W.l.o.g. 3^ H 3^ = Vi ^ j. Any p E 
V can be written as Y!i=i a ifi and fi^Z choosing fa with supp(^) C y i9 f { = p\ y J J2 x ey z p( x ) 
and a.i = J2 x eyiP( x ^' Mixt*(£) = P. For strictly positive distributions: The distributions in £ 
are strictly positive, therefore Mixt m (£) C V. For the other direction: Let Y{ := P(J^). The sets 3^ 
are disjoint faces of P whose union covers all point measures {5 x } xe x> We have seen that the mixture 
map </> : D := V m x (x£LiQ) P ; (a, 771, . . . , r? m ) H> XX 1 is surjective, where p v := 

(A^)- 1 ^). It is easy to check that the restriction <j>\ c : C -> dV, with C := d(V m x (x™ ^A-Y^))) 
is a continuous bijection between the compact domain C and the Hausdorff codomain dV. Therefore 
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Figure 1: Left: Simplex of probability distributions on {0, l} 2 , the set of product distributions S 1 
and the convex support Q\. The facial sets (support sets of distributions from S 1 ) are X, as well the 
pairs {(0, 0), (0, 1)}, {(0, 1), (1, 1)}, {(1, 1), (1, 0)}, {(1, 0), (0, 0)} (the edges of the 2-cube), and 
the individual elements of X (vertices of the 2-cube). The 5-sets are the edges of the 2-cube and 
vertices. Right: Schlegel diagram of the four dimensional probability simplex on {0, ... , 4} and the 
corresponding projection of a two dimensional exponential family £q with convex support Qq given 
by the depicted pentagon. The color indicates the value that the distributions take on x = 4; blue for 
p(4) = and red for p(A) = 1. The uniform distribution | as well as 64 are projected into the same 
point. 



4>\c is a homeomorphism and induces isomorphisms between the homotopy groups of C and those 
of dV ~ S^~ 2 (S^~ 2 denotes the (\X\ - 2)-sphere). Notice that <j>(D) C V. For any e > we 
find a continuous deformation C — > C C D which is mapped by into a continuous deformation 
dV -+ <f>(C) CP\F, V e :={p e V: p(x) > e Vx E X}. If <f>(D) didn't contain V\ then 0(C) 
wouldn't be contractible in <fi(D), in contradiction to the fact that D is contractible. Obviously any 
element of V belongs to some V 6 . Hence, Mixt m (£) 3 V. 

2. Consider some p G £ ' with a support Z E T{£ ,s ). If p is written as a mixture of elements from 
£, then every summand with positive mixture weight must have a support 3^ G with 3^ C Z. 

Furthermore, the union of the support sets of these summands must be Z. The minimal number is 
precisely ^(Z). □ 

Example 4 (Mixtures of two independent binary variables). Amari J2) showed Mixt 2 (S 1 ) = V({0, l} 2 ) 
The set of mixtures of two fixed distributions p and q on X can be represented as a straight interval in 
connecting the two points. If p and q are moved freely within f 1 , the set of intervals connecting 
them fills the entire probability simplex, (see Figure [T] left). The situation is less clear for more than 
2 variables, e.g. dim(f 1 ) = N « log 2 (dim(7 :> )) for N variables. Close inspection reveals that mix- 
tures of two elements from the intervals [5(o,i) ? ^(1,1)] an d [^(1,0)? ^(o,o)]» which belong to ^({O, l} 2 ), 
already suffice to fill V. These intervals correspond to a partition of X into two 5-sets. 

The following result will help us estimate k\ for specific choices of £ : 

Lemma 5. (5-sets of exponential families) Consider an exponential family £ C V(X). The follow- 
ing items are equivalent: 

• Every probability distribution with support y C X is in £, i.e. y is an S-set. 
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• y is facial and conv{Ay} ye y is a (\y\ — 1) -dimensional simplex. 

• supp(m ± ) (jL y Vra <G ker (A) C \ {0}, where m ± (x) = max{0, ±m(x)} Vx <E X. 

Proof. The first item implies the second because the linear map A is a bijection on the simplex V(y) . 
For the other direction: The matrix Ay := (A y ) ye y defines an exponential family £y = £ D V(y), 
because y is facial. If conv{A y } ye y is a (\y\ — l)-simplex, then all columns of Ay are affinely 
independent. In fact they are linearly independent (1 is a row of A), and ker Ay = {0}. Hence, 
any p E V(y) trivially satisfies Yl x (p(x)) m+ ^ — Yl x (p(x)) m ^ = Vm E ker Ay, which implies 
P £ £y [USED. The third item is equivalent to: y is facial [29 1 and additionally supp(m) (jL y Vm E 
ker(A). This implies ker Ay — {0}. See the Appendix for more details. □ 

Example 6 (Mixtures of an n-gon exponential family). Let X = {0, . . . , n — 1} and let £ be an 
exponential family with convex support given by an n-gon (a polygon with n vertices). This kind 
of families is interesting in the context of model design [5|, since they are only two-dimensional 
and contain every point measure S x in their closure. Without loss of generality we assume that the 
boundary of Q is given by the polyline AqAi • • • A n _iAo. The facial sets are: X, the pairs {i, i + 1} 
mod n and the points {i}i e x> All facial sets but X are 5-sets. It is not difficult to see that X is covered 
by n s s = [ 5] 5-sets, while the packing of any set y C X requires at most max Kg = [|J facial sets. 
By LemmapUhe smallest m for which Mixt m (£) = conv(f) = V satisfies [|J < m < |"|] . In the 
case n = 5 (see Figure [l] right) we can show that m > 2 = [|J is necessary and sufficient: 

Proposition 7. Mixt 2 (£ G ) = V. 

Proof. See the Appendix. □ 



4 Decompositions of the Sample Space 

In view of Lemma|5]we can find K s g if we determine the simplex faces of Q and how many such faces 
suffice to cover all vertices. This has the flavor of a covering code problem, which can be difficult, 
e.g. finding a minimum clique cover is a graph-theoretical NP-complete problem, and perfect cov- 
ering codes on {0, 1}^ are not completely understood (see I1TT10 . On the other hand, especially for 
hierarchical models, the convex support is highly structured. We focus now on 5- set coverings and 
facial packings for hierarchical models. 

We denote by Z± the binary vectors of even (odd) parity: 

Z ± :={xeX = {0,l} N : J] (-1)*< = ±1} . (1) 

iE[N] 

The following example will help us to illustrate some results: 

Example 8. The convex support of £ 2 ({0, l} 4 ), Q2, is a polytope of dimension 10 and has 16 ver- 
tices. We used Polymake lfT3l to compute the face lattice of Q2. It has 56 facets (proper faces of 
maximal dimension). From these, 16 contain only 10 vertices and are simplices. One of the cor- 
responding S-sets is the following: y = {(0000), (1000), (0100), (0010), (1001), (0101), (0011), 
(1101), (1011), (0111)}. In total 8 S- sets contain 6 elements from Z+ and 8 contain 6 elements from 
Z-. The other 40 facets have 12 vertices each. Denote the 5-sets (of cardinality 10) by {Fi} and the 
remaining facial sets (of cardinality 12) by {Gi}. We found that Fi U Fj ^ X Vi, j and Fi U Gj ^ X 
Vi, j. Since all faces (facial sets) and in particular all simplex faces (5-sets) must be subsets of some 
facet, this computations show that a minimal covering of X using 5-sets needs at least 3 faces. 



Mixture Decompositions 



1 



A polytope P C M, d is K-neighborly if the convex hull of any K or less of its vertices is a face 
of P E2l|3Tl. The following result by T. Kahle gives some information about S-sets of hierarchical 
models: 

Theorem ( |[T9l Theorem 13]). Let A C 2^ and let (k + 1) be the minimal cardinality of a set in 
2^1 \ A, then every distribution p with | supp(p)| < 2 k is in £&• Qa is (2 k — 1) -neighborly. 

For binary variables the maximal neighborliness degree of Qk is (2 k — 1). In this sense llT9l 
Theorem 13] is optimal. There exist sets of cardinality 2 k which are not S-sets of £ k . 

Proposition 9. Let X = {0, 1}^ and < k < N. Consider £ k and any y\c e X\c , A C [N], |A| = 
k + 1. Clearly \ [y\ c ] H Z± \ = 2 k . Any y C X containing [y\c] fl Z± is not an S-set. Ify D [y\ c ] H Z± 
3^ 2 [2/A c ]^ 3^ ^ not facial. 

Proof. See the Appendix. □ 

Note that the 5-sets described by 1 19, Theorem 13] are not the only S-sets of £ k . For example, 
for g-ary variables we have that the cylinder sets from Example [2] are £-sets of £ k of cardinality 
q k > 2 k — 1. The following classic result of convex polytopes is a helpful observation for our 
purposes: 

Theorem (O Theorem 7.4.3]). Let Q be a if -neighborly d-dimensional polytope. Then, every face 
F of Q with < dim F < 2K is a simplex, i.e., Q is (2K — l)-simplicial. 

This implies that all (2K — 1) -dimensional faces of Qk are simplices, where K := 2 k — 1. If 
K > [^(dimQfc)J, then Qk is a simplex (this only occurs when k = N and £ N = V). All support 
sets of distributions within £ k with a cardinality 2K or less are 5-sets. In the following we focus on 
the binary case and provide an analysis that will be used in Section [5] to show that it is possible to 
cover most vertices of Qk using disjoint (2K — 1) -dimensional faces. In Section[5]we will return to 
the non-binary case. 

For < k < N, any (k + 1) -dimensional cylinder set [y\c], A C [AT], |A| = k + 1 is facial for £ k 
(similar arguments as in Example^ and hence, the vertices of Qk can be covered by 2 iV- ( fc+1 ) disjoint 
faces {Fi}i. 2K of the (2K + 2) vertices of each Fi are covered by a simplex face. Any polytope 
which is not a simplex always contains two disjoint faces of complementary dimension [8], such that 
the two additional vertices can be chosen as an edge of Fi. These two vertices in each Fi can be 
covered using the fact that Qk is if -neighborly, but also arranging the simplex faces conveniently, as 
we will in Section [5] To this end we use the following lemma, which summarizes properties remarked 
in HHdHEl. W e provide a thorough proof in the Appendix. 

Lemma 10. Let < k < N and X — {0, 1}^. Any (k + 1) -dimensional cylinder sety C X is facial 
for £ k and the corresponding face F of the convex support is a simplicial polytope combinatorially 
equivalent to the cyclic polytopJ^p(2 k+1 ,2 k+1 — 2). There are exactly 2 2k S-sets of cardinality 
(2 k+l — 2) contained in y. The S-sets contained in y are {Z C y : Z± 2 Z}. 

Proof. See the Appendix. □ 



! A d-dimensional cyclic polytope with v vertices is defined as the convex hull of v different points on the d-moment 
curve lfT2l : C(v,d) := conv{x(ti)} ie j ,\i\= v » where v > d+l,/isa linearly ordered set, i \-> U is a strictly monotone 
function, and x is the moment map 
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From Proposition[9]we can derive an upper bound for the cardinality of an 5-set. Let K{N, k + 1) 
denote the smallest number of elements from X needed to mark all (k + l)-dim cylinder sets. Let 
5/v,fc+i denote a Hamming ball of radius k + 1. 

Proposition 11. Ify C X is an S-set of£ k , then \y n Z±\ < 2 N ~ l - K(N, k + 1) < 2 N ~ 1 (1 - 
2/\Bjsr,k+i\) and \y\ < |A&|. Since X is disjointly covered by the two sets Z+ and Z- we also get 
\y\ <2 N -2K(N,k + l) <2 N (l-2/\B NM1 \). 

Proof. See the Appendix. □ 

Example 12. For N = 4 and k = 2 any S-set 3^ satisfies | y n Z± \ < 8 - 2 = 6. In view of Example[8] 
the bound of Proposition[TT]is sharp in this case. 

We close this section with a short discussion about symmetries of Qa and implications for fa- 
cial sets and 5-sets. Observe that if A is symmetric with respect of permutations of the coor- 
dinate indices tt : [N] — » [N] then also Qa has this symmetry. If y is an 5- set of £ k , then 
n(y) : = {(^7t(i)j • • • > x ir(N)) x <E y} is also an S-set for any permutation tt. A further symmetry is 
given by re-labeling the values of the variables: 

Proposition 13. For £ with sufficient statistics A = ((-l) |supp(x)nA| )AGA,xG^, A C 2^, we have 
the following: y is an S-set 44> x * 3^ := {x + y mod 2 : y E y} is an S-set Mx^X.IfycXis 
facial then also x * y is facial for all x E X. 

Proof. See the Appendix. □ 

It is not difficult to see that the elements of the family {x * y} x ^x need not be different from each 
other, but they are if \y\ is odd, or if y is a Hamming ball. For any 3^ C X, y ^ we have that 
^xexx * y = X, since {x * z : x E X} = X for any z E X. We also have that \x * y\ = \y\ for any 
x E X, y C X, since the map * is invertible, x * (x * y) = y. These observations have interesting 
relations to coding theory as discussed in EH . 



5 Mixtures of Hierarchical Models 

We start the discussions of this section with £ 1 and treat £ k at the end. 

Observation 14 (Support sets and 5-sets of product distributions). The strictly positive product 
distributions on X = x ie ^]Xi, X{ finite, are given by £ x = {p E V(X) : p(x\, . . . , xjy) = 
YliPi( x i)iPi £ V(Xi)}. The convex support of £ x is the cartesian product Q\ = XiSi, where Si is 
a simplex with \Xi\ vertices. The facial sets are the sets of the form {x ie ^yi : 3^ C Xi Mi E [iV]}. 
The 5-sets are those of the form {{x\, . . . , xjy) : X{ E 3^, £j = yj Vj 7^ z}, for some i E [iV], 
3^ C Xi and ^ E Xj Vj 7^ z. 

Let A q (N, d) := max{|3^| : 3 ; C s.t. y) > d Vx, y E 3^, # 7^ 2/}. This function gives the 
maximal cardinality of a g-ary code of length N and minimum distance d. It is a familiar object in 
coding theory. 

Theorem 15. (Mixtures of product distributions on N g-ary variables) Let X = F^. 

• m > q N ~ 1 => Mixt m (f 1 ) = V . 
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Mixt m (f 1 ) D V => m > Aq(N, 2) , where A q (N, 2) > and A q (N, 2) = q N ~ 1 if 

q is a prime power. 



Remark 16. (1) Clearly Mixt m (£) = Mixt m (£). Theorem [l5j implies that a mixture of q N ~ 1 
elements from £ l approximates any distribution in V arbitrarily well and that V \ Mixt m (f 1 ) has 
non-empty interior whenever m < A q {N, 2). 

(2) The arbitrarily accurate approximation of arbitrary distributions requires a huge number of product 

a N — l 

mixture components, larger than apparent from naive parameter counting (m > x( q -i) ) m 

(3) Our proof in fact yields that if \Xi\ ^ \Xj\ for some i,j G [N], then m > \X\/mSiX ie ^ N ] => 
Mixt m (£) =V. 

(4) The convex support of the independence model is not two-neighborly. A decomposition of X 
based only on neighborliness would yield \X\ components, instead of \X\j max^ \X{\. 

Proof of Theorem\l5\ We use Lemma[3] Let X = x ie ^]X{ with arbitrary but finite sets Xi. We 
show that m > \X\j max^ \ X{\ mixtures of £ l suffice to represent all V. The convex support of £ 
is Qi = XiS\Xi\, where S\ Xi \ is a simplex with \Xi\ vertices. W.l.o.g. let \X\\ = max^ \Xi\. The 
following |Af|/|A4| S-sets cover X: y)} x eXi '• V £ x ie[N]\{i}^i}- For the second item: An 

edge of Qi is given by a pair {A X1 A y } with g-ary vectors x and y of length N which differ in exactly 
one entry, H(x,y) = 1. A set y C X containing no such pair can be packed only using S-sets of 
cardinality one, because any facial set of cardinality larger than one always contains edges. If the 
minimum distance of a code is two, then obviously the code doesn't contain any edges. The Gilbert- 
Varshamov bound 021 [34) is: A q (N, 2) > ^d-i (Ny q an d i n th e prime power case it reads: 

7 ^ 7 N 

A q (N, d) > q k , where k is the largest integer with q k < d _ 2 /n-iv r- On ^ e °th er h an d we 

Z^=o { j J(<7-1) J 

have the singleton bound ll32l : A q (N, d) < q N ~ d+1 . For d = 2 the combination of the two bounds 
completes the proof. □ 

In the case of binary variables we have the following: 

Observation 17. For X = {0, 1}^ the convex support Qi is a combinatorial Af-cube, denoted Cjsf. y 
supports a distribution in £ 1 iff 3^ is a cylinder set, i.e. a face of Cjsf. y is an 5-set iff it has cardinality 
one, or consists of two binary vectors differing in exactly one entry. 

It is not difficult to check that the two complementary sets of even and odd parity from equation [TJ 
Z + and Z_, are perfect binary codes of length N and minimum distance 2, i.e., they are sets of binary 
vectors where any two elements differ in at least two entries and have the maximal cardinality that 
such a set can have \Z±\ = A2(N, 2) = 2 N ~ 1 . The binary formulation of Theorem 
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reads: 

Corollary 18. (Mixtures of product distributions on binary variables) Let X = {0, 1}^. 

• Any p E V with support contained in the union of n edges of the N-cube is contained in 
Mixt^f 1 ). Hence, m>2 N ~ 1 Mixt 77 ^ 1 ) = V . 



Any representation of a p E V with supp(p) C Z± as mixture of product distributions has 
at least |supp(p)| components with support of cardinality one. Hence, Mixt m (f 1 ) D V =>- 



m > 2 



N- 



A consequence of this is the following: 
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dim(£J)+l 
7V+1 



We 



Corollary 19. Let 1 < j < N - 1. Mixt m (f 1 ) D iwip/fej rn > max |/^{ 
teve that k{j > max{|Z| : Z E , Z C Z±} > 2* - 1. 

The dimension of £ J is given by dim £ J = J2i=i d)- 

Proof. IfMixt m (f 1 ) D £ J it is necessary that dim(Mixt m (f 1 )) > dim£ j . Parameter counting 
yields mN + m-l> dim(Mixt m (f 1 )) > dim£ fc . The quantity max{|Z| : Z E F(£ j ) and Z C 
Z±} is lower bounded by 2 J — 1, which can be seen from [191 Theorem 13]. □ 



Example 20. For X = {0, l} 4 and £ 2 Corollary [l9j yields (see Example |8J): Mixt m (f 1 ) D £ 2 => 
m > 6. For iV = 4 and /c = 3, Q/c has dimension 14 and [(dim £ 3 + l)/(4 + 1)] = 3. On the other 
hand 2 3 - 1 = 7, and Corollary [5] yields: Mixt m (f T ) D£ 3 4m>7. 

Now we turn our attention to mixtures from £ k . Let kk(y) denote the minimal cardinality of 
a covering of y using cylinder sets of dimension k. Lemma [2j [19, Theorem 13] and Lemma [3] 
immediately yield the following: 

Proposition 21. Let X — x ie ^Xi. Consider any p E V and let n := min{ , %(supp(p))}. 

77ze?z the following holds: m > k =^ p E Mixt m (£ fc ) . 



The following Theorem [22] is a stronger result for the case of binary variables. We use the S-sets 
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of cardinality 2{2 h — 1) from Lemma 

Theorem 22. (Mixtures of S k ) Let X = {0, 1}^. There exist k := 2 N ~( k+1 \l + ^) «wip/«c 
/aces ofQk which cover all vertices ofQ^. Therefore, m > n Mixt m (£ fc ) — V . 



Remark 23. (1) Obviously Theorem 22 holds for any £ D £ k , e.g. for any £& with A D A&. 
The specified number of mixture components from £ fc suffices to approximate any element from V 
arbitrarily well. This halves the bound given in Proposition 21 for the case of full support binary 
distributions. 

(2) For k = 1 we have that k = 2 N ~ X . Therefore, Theorem 22 includes the first part of Theorem 18 
for binary variables. 

(3) For N = 4 and k = 2 the cardinality bound for a simplex face covering of the vertices of Qk given 
in Theorem 22 is [2 4 "( 2+1 ) /(l - 2" 2 )] = 3. This is optimal in view of Example[8 

(4) As a consequence we have: For any p E 7^({0, 1}^), if m > £fc+i(supp(p)j(l + p^rj). then 

p E Mixt m (£^). 

Proof of Theorem [22] Consider the following partition of X into (k + l)-dim cylinder sets: 

{Cy} y := {{x\ + \ < 2 ) G {0, 1}" : < 2 = 2/}, e{0)1}J v-( fc+1 ) 



By Lemma 10 for any y e {0, 1}^ the elements of C y are disjointly covered by: 

(i) An S'-set of £ fc of cardinality 2K. We denote this set by G y . 

( ii) A pair E y , which can be chosen to be any edge of C y (a pair differing in one entry), in particular: 



E y = {(zlxk^y) e {0,1}^: z\ fixed } . 
The vector z can be chosen to be the same for all E y , such that the 5-sets {G y } y satisfy: 



(2) 



u 



{0,1}^ 



N-k j 



yG{0,l} iV -( fe + 1 ) 
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where C^-k is the following (N — k) -dimensional cylinder set: 



C 



N-k 



u 



{{zl^- k ) affixed} 



2/G{0,l} iV -( fc + 1 ) 



The set Cjv-fc can be considered as new sample space which still has to be covered using 5- sets. If 
N — k < k + 1, only one 5-set is required. Iteration until exhausting all coordinates yields that k, the 
minimal number of faces of Qk which are simplices and suffice to cover all vertices, is not more than: 



k < 1 + 



E 



0<i< 



JV-(fc + l) 



-iN-ik 



2 k+l 



2 k+ 



tE 

i=0 



1 




' 2 N-(k+l)' 


(2 k y 




1 - 2~ k 



□ 



Appendix 

Proof of Proposition^ Assume, as always in this paper, that the sufficient statistics contains the row 
1. The image of the map tt : p \-> A • p restricted to V is the convex support Q := corwA x . Since tt 
is continuous, £ is compact, and Q is Hausdorff, this bijective restriction is a homeomorphism. We 
denote by p v the unique preimage of 77 E Q by the restricted moment map, p v = (tt]-^) -1 (rj) . The 
ra-th mixture of £ is parametrized by a mixture map in the following way <fi: D := V m x Q m ~^ 
V ; (a, 7/1, . . . , r/ m ) ^ JXi From Mixt m (£) D dV :=V\V it follows that the restriction 

0|c : C := d(V m x Q m ) — » is a continuous surjection. Consider the normal space of £, which 
is given by J\f = ker A. For any p E P the linear model A/^ := {q £ V : p — q £ AT} intersects £ 
at a unique point p s e£ nAf p (see (2§1 Theorem 2.16]). Hence V = {£ + ker A) n P = U G £.A/J>. 
For every p £ V, J\f p is a polytope of dimension dim ker A. In the present case dim ker A = 2. 
The boundary of J\f p is contained in the boundary of V. Now, for any p E £q we consider the subset 
= = {(a, 771, . . . , r/m) £ • = ^(p)}- This set is mapped by (j) to all convex 

combinations of m elements of £q which have the same expectation parameter as p. We consider 
also dB p = B p D (V m x (dQ) m ), which corresponds to the same kind of mixtures, but with mixture 
components from the boundary of £. We have that dB p — » dj\f p is surjective and has degree ml 
(this is the cardinality of the preimage of a regular value, which arises from the freedom to permute 
the mixture components). For Qq it is not difficult to see that dB p is parametrized by an angle, say 7, 
and that (f)\dB p (l) circulates dj\f p twice. Using that B p is contractible, it follows that (j)\ Bp = J\f p and 
Mixt m (£) = V. For strictly positive distributions the claim follows from the fact that 4>(B p ) C V 
and that the image of an ^-retraction of B p , (1 — e)(B p — p) + p, can be made such that it contains 
any 5-retraction of J\f p , (1 — 8) (M p — p) + p. □ 

Details to Lemma\5\ For completeness we provide here a more extensive characterization of 5-sets. 
For any m E let m ± (x) := max{0, ±m(x)} and let p m := rUe* (p(x)) m(yX \ We use the 
following results from IQ31I29): (I) A set y C X is (J-facial iff there exists one p E £ with supp(p) = 
y iff supp(m + ) C y 44> supp(m~) C 3^ for all m E ker A. (II) A distribution p is contained in £ iff 
7? fulfills the equations p m+ — p m = Vm E ker A. 

Let T be an index set with a partition AuA c = T. Let {A(\, X) := (A(A, E R* : A E T} 

be a basis of and let {A(A, Af) : A E A} contain 1. Consider an exponential family £ with 
sufficient statistics A(A, Af) = (A(A, ^))agA,xgA'- Given any ^ C Af the following statements are 
equivalent: 
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(i) Every p E P with supp(p) = 3^ is contained in £. 

( ii) Every p £ V with supp(p) C 3^ is contained in £. 
(Hi) supp(m+) <f_ y Mm E ker A(A, Af) \ {0}. 

(ivj vkA(A c ,y c ) = \A C \ and y is facial. 

(v) rkA(A i y) = \y\ and y is facial. 

(vi) Every }/ C y is facial. I.e., y corresponds to a (|3^| — 1) -simplex face of the convex support. 



Item |(iii)| implies the equivalence of[(i)]and [(ii)l The claim |(ii)| // |(iii)| follows directly from (II). For 



the implication only if wo have to show that if supp(m + ) fl y c = for some m E ker A(A, X) \ {0}, 
then there exists ap E V with supp(p) = 3^ and p n+ —p n ^ for some n E ker A(A, X). Assume 
supp(m + ) C y. If there exists one p E £ with support 3^ (if none exists we are done), then 3^ is facial 
andsupp(m-) C y. We write (p € ) {i:m , /0 } = £ R s + upp(m+) x r^™ - ). | S upp(m ± )| > 0, 

since = (A(0, X),m) = J2x m ( x )' Assume ||£||i < (if this is not possible, again we are 

done). By (II) £ m — fj m = 0. Now consider a p with p(x) = p(x) Vx : m(x) = 0, and £ = 2£, and 
rj = (1 — ||£||i/||^||i)^ in the other entries. We have ||£||i + ||^||i = ||£||i + \\fj\\i s.t. p E V, and: 

r + = (V- m+ > - (i- iifiii/iir?iii) (1 ' m } )r + . 

W.l.o.g. (l,m ± ) > 1. Since < ||f ||i/||i?||i < 1 and £ is greater than in every entry, £ m+ — rf 1 ^ 
0. 

|^1 iff |(m)| It suffices to show: For a facial y it is rk A(A C , y c ) = \A C \ if and only if y c n 



supp(m) 7^ Vm E kerA(A, X) \ {0}. Any m E kerA(A, A?) can be written as m(x) 
(a, A(r, x)), where supp(a) C A c . For any x E X m(x) = (a,A(T,x)) = a_L4(r,x). 
Hence, 3^ c H supp(m) = is equivalent to the existence of some aGl r such that a_L4(r, x) Vx E 
3^ c . These equations can't be satisfied for any with supp(a) C A c iff rkA(A c , y c ) = |A C |. 

[(v)liff[(Iv^| We show rk ^(A,^) = min^l, | A|} iff rk A(A C , y c ) = unn{\y% |A C |}. Consider 



first |3^| = | A|. It suffices to show one direction, since one may define A' = A c , y' = y c . If A(A, y) 
has full rank, then there exist two invertible | A| x | A|-matrices L and R such that LA(A, y)R = /|a| • 
Now, multiplication of A with the block diagonal concatenation of L and R with 7| A c| and appropriate 
row and column addition gives diag(/|^| , A{A C 1 y c )). The rank of this matrix is the same as that of 
A, and hence vkA(A c ,y c ) = \A C \. Consider now the case |3^ ^ |A|. W.l.o.g. \y\ < |A| and 
rk A(A, y) = \y\. Since A(A, X) has full rank | A|, there exists a set y s.t. X D y D y, \y\ = \ A\ 
and rkA(A, y) = |A|. From the first part we have that this is equivalent to rkA(A c , y c ) = |A C |. 
But this implies rkA(A c , y c ) — | A c |, since 3^ c 2 3^ c . The other direction is analogue. 



(vi) iff (v): rkA(A,3 ; ) = |3^ is equivalent to {A(A,y)} ye y being linearly independent, such 
that conv{A(A, y)} y ey is a (|3^| — 1) -simplex. If 3^ is facial, then all sets y r C 3^ are facial. If 
conv{A(A, is a simplex face of Q, then {A(A, are affinely independent and in fact 

linearly independent. □ 

Proof of Proposition^ From Lemma [5] we have: 3^ is not an 5- set 44> 3m E kerA(A, X) such 
that supp(m + ) C 3^. If 3m E kerA(A, X) \ such that supp(m + ) = y, then 3^ is not facial 
(from fllUg). From flU we have that VA C [N] with |A| = k + 1 and Vy E {0, there 
exists an m E ker A(A&, Af) such that supp(m) = {x E A' : xa^ = y} =: C, (which is a (fc + l)-dim 
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face of the TV-cube), and m\c oc A(A, C). Now observe that A(A, x) is just the parity of x\, i.e., 
A(A, x) is 1, if xa has an even number of ones and —1, if x\ has an odd number of ones. This means 
that supp(m + ) = Z + fl C = {x E X : X^zgA x * mo< ^ 2 = 0, xa^ = y}. Hence the claim. □ 

Proof of Lemma\W^ The set y has cardinality 2 k+1 and therefore F has dimension at most 2 k+1 — 1 = 
+ 1. In fact it has a dimension strictly less than 2K + 1, since in that case it would be a simplex, 
in contradiction to Proposition [9] On the other hand, if the dimension of F was less than 2K then, 
by Theorem |16, Theorem 7.4.3], it would be a simplex, in contradiction to the number of vertices. 
Hence, dim F — 2 k+1 — 2 and all proper faces of F are simplices. The combinatorial equivalence of 
F to the cyclic polytope C(2K + 2, 2K) follows from the fact that Any 2n-dimensional, n-neighborly 
poly tope with v < 2n + 3 vertices is combinatorially equivalent to the cyclic polytope C(v, 2n) f\6i 
Theorem 7.2.3]. To complete the proof we use Gale's Evenness Criterion: A d-tuple Vj = {x(tj)}j e j 
J C [f], | J | = d of vertices of C(v, d), spans a facet iff between any two elements of J there is an 
even number of elements in [v] \ J [16, Theorem 4.7.2]. Here we have v = 2 k+1 and d = 2 /c+1 - 2. 
The combinatorial structure of the cyclic polytope is independent of the map i 4 fj and we may 
choose I = [v] := {1, . . . , 2 /c+1 } C N. The sets Vj, | J\ = 2 /c+1 — 2 satisfying the evenness criterion 
are exactly the complements of pairs {z e , i°} C [v], where i e is even and i° is odd. There are 2 2k such 
pairs, and hence of facets. This is the same number of sets respecting the condition on £-sets from 
Proposition^ Therefore, all sets Z respecting that condition, Z 2 ynZ±, must correspond to facets 
of C(2 fc+1 , ¥ +1 - 2) and are indeed 5-sets. □ 

Proof of Proposition^ For any 5-set y of £ k we have: \{C fl Z±) \ y\ > 1 for any (k + 1)- 
dimensional face of the TV-cube C. Therefore, the maximal cardinality of an 5-set y C Z± is upper 
bounded by \Z± \ —K(N, k+1), where K(N, k+1) is the smallest number of elements needed to mark 
all (k + l)-dim faces of the TV-cube. The set of vertices of all {k + l)-faces of the TV-cube containing 
a common mark x correspond to the Hamming ball Bn^+i( x ) Q X of radius k + 1 centered at 
x. Hence, K(N, k + 1) is the minimal cardinality of binary codes of length N and covering radius 
k + 1. In the case R < N < 2R+ 1, clearly K(N, R) = 2, but in general computing K(N, R) is hard 
(see El). A crude lower bound is the sphere-covering bound: K(N, R) > 2 N /\Bn^r\, which is only 
optimal if the faces containing different marks can be chosen to be disjoint. Here \Bn,r\ = (T) • 

On the other hand, the cardinality of an 5- set of £ k can't exceed dim + 1 = | A&| = \B^^\, since 
the dimension of the corresponding face can't be larger than that of Qk- □ 

Proof of Proposition^ The set y is an 5-set iff (i) rk A(A, y) = \y\, (i.e. y describes a (\y\ - 1)- 
simplex), and (ii) 3c E R |A| s.t. (c, A(A, y)) = Vy E y and (c, A(A, x)) > 1 Vx E y c , (i.e. y is 
facial). We show that y has these properties iff x * y does. We have that 

A(\,x*y) = (_i)l(supp(x)Asupp(2/))nA| = (_ 1 )|supp(^nA|^_ 1 ysupp( 2 /)nA| ? Vx E Af, A £ 2^,y E Af 

and thus A(A,x*y) = diag (A(A, x))-A(A, y). Hence, rkA(A,3^) = rkA(A, x*3/). On the other 
hand we can define c := diag(A(A, x)) • c, and we get (c, A(A, x * y)) = (c, A(A, y)) = Vy E 3^- 
Similarly, (c, A(A, z 7 )) > 1 E (y * 3^) c . ' □ 
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